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Effects  of  Item  Characteristics  on  Test  Fairness 


Mental  ability  testing  is  commonly  used  in  education,  industry  and  the 
military  services  to  select  and  place  individuals.  Test  results  are  also 
used  in  research  as  a basis  for  making  inferences  about  the  intellectual 
endowment  of  various  individuals  and  subgroups.  However,  many  of  these 
tests  have  often  been  cited  as  being  biased  and/or  unfair  to  certain  subgroups 
of  the  general  population,  including  Blacks,  Spanish-speaking  Americans  and 
Native  Americans.  Because  of  the  prevalence  of  testing  in  our  society 
and  because  of  the  possible  discriminatory  nature  of  some  tests,  there  has 
recently  been  an  increase  in  research  on  the  nature  and  degree  of  test  bias  and 
test  fairness  in  various  settings,  including  examination  of  various  ways  to 
reduce  test  bias  and  unfairness  where  they  exist. 

A necessary  prerequisite  for  carrying  out  meaningful  research  in  this  area 
is  to  define  exactly  what  is  meant  by  bias  and  unfairness.  Over  the  last 
ten  years,  a number  of  models  have  been  proposed  to  provide  such  definitions. 

Many  of  these  models  are  quite  different  in  philosophy  and  purpose.  A useful 
taxonomy  often  suggested  (Flaugher,  1974;  McNemar,  1975;  Pine  & Weiss,  1976). 
is  to  separate  models  of  bias  from  models  of  falrnes^.  The  essential  distinction 
is  that  models  of  bias  represent  the  psychometric  properties  of  a particular  set 
of  test  items  or  test  scores.  Models  of  test  fairness  typically  are  concerned 
with  the  impact  a test  will  have  when  used  in  a particular  application.  The 
application  most  often  considered  is  the  selection  or  placement  of  personnel. 

However,  there  is  a direct  relationship  between  the  item  characteristics 
of  a test,  including  the  degree  of  item  bias,  and  its  fairness  when  used  in  a 
selection  program.  Although  substantial  amounts  of  research  have  dealt  with 
the  effects  of  item  characteristics  on  test  validity  (Brogden,  1946;  Gulliksen, 
1945;  Tucker,  1946;  Urry,  1969),  no  efforts  have  been  made  to  study  the  effects 
of  item  characteristics  on  test  fairness.  Even  for  validity,  the  effects  of 
possible  bias  in  the  test  items  have  not  been  considered. 

There  are  a number  of  possible  reasons  for  this  lack  of  research.  First, 
selection  fairness  models  are  relatively  new.  Second,  empirical  Investigation 
in  this  area  is  often  expensive,  impractical  due  to  the  relative  unavailability 
of  minority  group  members,  and  hampered  by  the  absence  of  a suitable,  unbiased 
criterion  measure.  Furthermore,  in  selection  of  fairness  models,  tests  are 
considered  only  in  terms  of  their  final  scores.  Therefore,  the  internal 
properties  of  a test  are  generally  ignored.  This  approach  is  detrimental  to 
the  development  of  tests  which  might  be  designed  to  reduce  unfairness. 

This  report  offers  a general  method  for  examining  the  relationship  between 
selection  and  placement  fairness  and  the  characteristics  of  test  items.  This  is 
accomplished  by  conceptualizing  bias  and  fairness  in  terms  of  latent  trait  theory. 
Criterion  performance  is  represented  by  the  latent  trait.  Item  bias  and  other 
item  characteristics  are  expressed  in  terms  of  latent  trait  parameters.  This 
approach  eliminates  the  possibility  that  the  criterion  Itself  may  be  biased,  and 
permits  direct  observation  of  how  the  characteristics  of  a test  affect  the 
prediction  of  a criterion  and,  in  turn,  selection  fairness. 
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Bias  and  Fairness 

Bias,  as  It  Is  used  In  this  report,  refers  to  those  subgroup  differences  In 
the  psychometric  properties  of  a test  which  occur  as  a result  of  factors 
extraneous  to  those  which  a test  Is  Intended  to  measure.  For  example,  mean 
test  score  differences  between  Blacks  and  l^Jhltes  on  a vocabulary  test  would  be 
considered  evidence  of  bias  If  these  differences  reflected  the  Influence  of 
cultural  factors.  In  this  case,  the  cultural  factors  would  be  extraneous 
since,  presumable',  the  test  I3  Intended  to  measure  verbal  ability. 

Most  of  the  models  of  bias  which  have  been  proposed  (Angoff  & Ford,  1973; 
Breland,  Stocking,  Plnchak,  & Abrams,  1974;  Echternact,  1974)  have  Involved 
comparing  Item  difficulties  among  subgroups.  According  to  this  approach,  a 
test  Is  considered  biased  If  Its  Items  do  not  have  roughly  the  same  relative 
difficulties  for  all  subgroups.  An  Item  within  the  test  is  said  to  be  biased 
if  it  is  relatively  more  difficult  for  a given  subgroup  than  are  most  of  the 
other  test  items.  Other  models  cf  bias  which  have  been  proposed  involve 
subgroup  comparison  of  item  discriminations,  mean  test  scores,  and  factor 
loadings  {e.g. , Angoff,  1975;  Atkin,  Bray,  Davison,  Herzberger,  Humphreys, 

& Selzer,  1976;  Jensen,  1975). 

Regardless  of  the  specific  model  used,  the  existence  of  bias  cannot  by 
itself  be  taken  as  pvima  fane  evidence  that  a test  is  unfair.  For  example,  a 
test  which  includes  a substantial  proportion  of  Black  slang  words  may  be  unfair 
when  used  to  select  college  freshmen,  but  fair  when  used  to  select  social 
workers  for  employment  in  the  Black  community.  Clearly  then,  the  fairness 
of  a test  (or  test  item)  can  only  be  determined  by  examining  what  caused  the 
bias  and  what  its  eventual  impact  will  be  in  a specific  application. 

For  the  specific  application  of  tests  to  the  selection  of  personnel,  a 
number  of  formal  definitions  of  fairness  have  been  developed.  One  of  the 
earliest  formal  definitions  of  test  fairness  in  selection  was  based  on  the 
concept  of  validity.  This  is  undoubtedly  due  to  the  fact  that  early  legal 
challenges  to  the  use  of  tests  for  personnel  selection  questioned  test  validity. 

Validity  model  of  test  fairness.  The  validity  model  is  primarily 
concerned  with  the  legitimacy  of  the  inferences  which  can  be  made  about  people’s 
ability  or  performance  in  a specific  situation  based  on  their  test  scores.  The 
validity  of  a test  is  frequently  determined  by  calculating  the  correlation 
coefficient  between  the  test  scores  and  scores  on  an  appropriate  criterion  for 
a particular  subgroup.  Fairness  of  a testing  procedure  has  been  evaluated  in 
terms  of  whether  there  is  a significant  difference  between  the  validity  coef- 
ficients for  various  subgroups  on  a given  test.  If  a significant  difference 
does  exist,  this  would  imply  that  the  predictions  made  on  the  basis  of  the 
test  scores  are  not  as  accurate  for  one  subgroup  as  for  another. 

In  a selection  situation,  such  a difference  in  validity  would  have  several 
adverse  effects  on  the  subgroup  having  the  lower  correlation.  First,  it  would 
decrease  the  variance  of  the  predicted  score  distribution.  Assuming  the  selection 
cutoff  to  be  above  the  mean  of  this  subgroup,  as  it  normally  would  be,  such  a 
decrease  in  variance  would  lower  the  probability  that  these  individuals  would 
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exceed  the  selection  cutoff.  Secondly,  the  lower  correlation  coefficient  indi- 
cates that  the  test  does  not  order  individuals  as  accurately  on  the  criterion 
as  it  would  for  a subgroup  having  a higher  validity  coefficient.  Consequently, 
if  selection  is  based  on  predicted  criterion  performance,  applicants  with  lower 
average  ability  will  be  selected  from  the  subgroup  having  the  lower  predictive 
validity  even  in  cases  where  the  subgroups  have  equal  mean  ability. 

Whether  or  not  meaningful  validity  differences  among  subgroups  occur  in 
real  selection  situations  is  an  empirical  issue  which  has  received  a great  deal 
of  attention  recently.  The  weight  of  the  evidence  (Campbell,  Crooks,  Mahoney, 

& Rock,  1973;  Farr,  O'Leary,  Pfeiffer,  Goldstein,  & Bartlett,  1971;  Schmidt, 

Berner,  & Hunter,  1973)  seems  to  indicate  that  meaningful  differences  occur  with 
very  low  frequency.  However,  a number  of  issues  remain  unresolved  regarding 
how  to  statistically  test  for  a subgroup  validity  difference  and  what  to  do  if 
it  is  statistically  significant  {e.g..  Standards  for  Educational  and  Psycholog- 
ical Tests,  1974;  Flaugher,  1974). 

Although  research  still  continues  on  differential  validity  as  a means  of 
evaluating  test  fairness,  it  appears  that  validity  is  a necessary  but  not  a 
sufficient  condition  for  test  fairness.  In  recognition  of  this  fact,  a number 
of  specific  models  have  been  proposed  for  defining  fairness  in  the  context  of 
selection. 

Other  models  of  fairness.  In  the  context  of  selection,  test  fairness  is 
directly  interpretable  in  terms  of  the  number  of  applicants  who  are  selected 
from  each  subgroup  of  testees.  Test  bias  influences  fairness  to  the  extent 
that  if  a test  is  biased,  it  will  often  produce  an  adverse  impact  on  the  sub- 
group against  which  it  is  biased.  This,  however,  will  depend  on  how  fairness 
is  defined  and  on  other  situational  variables,  such  as  the  criterion  for  success 
and  selection  cutoff  points. 

When  a test  is  used  in  the  selection  process,  it  is  part  of  a decision 
strategy  to  select  or  reject  potentially  successful  individuals  for  one  or  more 
available  openings.  Operationally,  this  is  usually  achieved  by  setting  a cut- 
off score  on  the  criterion  to  define  successful  performance,  determining  the 
corresponding  predictor  cutoff  scores,  and  selecting  applicants  with  predictor 
scores  equal  to  or  exceeding  the  predictor  cutoff  score. 

It  was  previously  indicated  that  a low  test  validity  for  a given  subgroup, 
which  is  equivalent  to  a larger  amount  of  random  errors  of  prediction,  can 
affect  selection  decisions  by  decreasing  the  probability  that  individuals  from 
that  subgroup  would  exceed  a given  cutoff  on  the  criterion.  Another  factor  which 
would  affect  the  prediction  of  criterion  performance  in  selection  is  constant 
errors  of  prediction.  The  random  and  constant  errors  of  prediction  can  be  res- 
pectively translated  by  regression  theory  into  the  slope  and  intercept  of  the 
regression  line  relating  test  scores  to  criterion  performance. 

Cleary  (1968)  developed  a widely  used  definition  of  selection  fairness, 
referred  to  by  her  as  'bias',  which  involves  the  regression  line  in  prediction. 
According  to  Cleary,  "A  test  is  biased  for  members  of  a subgroup  of  the  popu- 
lation if,  in  the  prediction  of  a criterion  for  which  the  test  was  designed. 
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consistent nonzero  errors  of  prediction  are  made  for  members  of  the  subgroup... 
(Cleary,  1968,  p.  115).  Theoretically,  consistent  zero  errors  of  prediction  are 
assured  by  employing  separate  within-subgroup  regression  lines,  i.e.,  differen- 
tial prediction.  Therefore,  the  application  of  Cleary's  definition  is  operation- 
ally equivalent  to  endorsing  differential  prediction  in  selection. 

This  fact  can  be  demonstrated  by  considering  the  situation  in  Figure  1. 

Figure  la  illustrates  the  situation  in  which  the  mean  criterion  score  for  the 

minority  subgroup  (Y  . ) is  equal  to  the  mean  criterion  score  for  the  majority 

min 

subgroup  (Y  .),  but  the  mean  test  score  of  the  majority  subgroup  (X  ) is 
maj  maj 

greater  than  the  score  (X  . ) of  the  minority  subgroup.  In  this  situation  it  is 

min 

clear  that  use  of  within-subgroup  regression  lines,  i.e.,  differential  prediction, 
will  produce  consistent  zero  errors  of  prediction  for  both  the  minority  and 
mniorlty  subgroups.  However,  using  either  the  regression  line  of  the  majority 
subgroup  or  the  regression  line  derived  from  data  pooled  across  both  subgroups 
will  lead  to  underprediction  of  the  minority  subgroup. 

A situation  more  commonly  found  in  extant  practice  (Cleary,  1968;  Gael, 

Grant,  & Ritchie,  1975;  Goldman  & Richards,  1974;  Kallingal,  1971;  Temp,  1971) 
is  where  subgroups  differ  on  both  the  criterion  and  test  scores,  as  shown  in 
Figure  lb.  In  this  case,  using  either  the  majority  or  pooled  regression  line 
to  predict  minority  criterion  performance  will  result  in  overprediction  for 
members  of  that  subgroup. 

In  recent  years,  a number  of  models  have  been  proposed  as  alternatives  to 
Cleary's  regression  model  of  selection  fairness  (see  Cole,  1973;  and  Petersen  & 
Novick,  1974,  for  reviews).  The  one  most  frequently  offered  as  an  alternative 
to  Cleary's  model  is  Thorndike's  (1971)  Constant  Ratio  model.  According  to 
Thorndike,  fair  use  of  test  scores  requires  that  the  acceptance  levels  should  be 
set  such  that  the  ratio  of  the  percentage  of  individuals  who  exceed  a specified 
level  of  criterion  performance  to  the  percentage  who  exceed  a cutoff  on  the  pre- 
dictor will  be  equalized  among  subgroups  in  the  applicant  population. 

One  of  the  primary  conclusions  chat  has  derived  from  the  research  on  test 
fairness  is  that  the  assessment  of  fairness  will  depend  on  how  fairness  is 
defined.  Some  of  the  models  that  have  been  proposed  will  lead  to  the  selection 
of  more  minority  applicants  than  will  other  models.  If  the  models  are  ordered 
along  the  dimension  of  how  many  minority  applicants  are  selected  in  a given 
situation,  the  Cleary  and  Thorndike  models  fall  near  the  extremes.  The  Cleary 
model  is  the  least  favorable  to  minority  subgroups,  while  the  Thorndike  model  is 
one  of  the  most  favorable.  Consequently,  these  two  models  make  a convenient 
pair  of  strategies  for  evaluating  the  fairness  of  a test. 

Purpose  and  Assumptions 

Purpose . In  their  book  on  mental  test  theory.  Lord  and  Novick  (1968,  p. 

388)  indicate  how  the  item  characteristics  of  a test  can  affect  the  shape  of  the 
distribution  of  test  scores.  As  can  be  seen  in  Figure  1,  selection  fairness  is 
a function  of  the  parameters  of  the  distribution  of  test  scores.  Therefore,  if 
the  item  characteristics  of  a test  can  affect  the  shape  of  the  test  score  dis- 
tribution, they  will  also  influence  selection  fairness.  The  purpose  of  this 
report  is  to  examine  the  relationship  between  characteristics  of  test  Items  and 
selection  fairness,  as  reflected  by  several  fairness  models. 


Figure  1 

Relationships  between  criterion  scores  and  test  scores 
for  majority  and  minority  subgroups 
with  unequal  mean  scores  on  the  predictor  variables 


(a) . Equal  Criterion  Means 


X , X , 

min  maj 

Test  Scores  (X) 


(b) . Unequal  Criterion  Means 


X , X . 

min  maj 

Test  Scores  (X) 
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Specif ical :v,  'le  following  questions  are  investigated: 

1.  How  do  the  following  characteristics  of  test  items  affect  fairness? 

a.  Distribution  of  item  difficulties. 

b.  Level  of  item  discrimination. 

c.  Degree  of  item  bias. 

2.  How  is  fairness  affected  by  test  length? 

3.  How  does  the  assessment  of  fairness  depend  on  the  choice  of  a model  for 

measuring  fairness? 

Answers  to  these  questions  should  be  useful  in  indicating  how  a fair  test  should 
be  constructed. 

Assumptions.  The  above  questions  were  investigates  in  the  context  of  an 
assumed  selection  situation  which  was  modeled  by  a monte  carlo  simulation  study. 
The  selection  process  consisted  of  administering  a selection  test  to  each  appli- 
cant and  using  the  score  from  that  test  to  predict  an  external  criterion  ’^epre- 
.-anted  by  the  known  latent  trait,  0.  The  applicant  population  was  assumed  to 
_onsist  of  two  subgroups  having  identiial  ability  distributions  on  0.  The 
selectloi-  instrument  was  assumed  to  be  completely  described  in  terms  of  its 
latent  trait  parameters  so  that  each  of  its  items  could  be  describ'd  in  terms  of 
item  discrimination,  item  difficulty,  and  probability  of  being  guessed 
correctly  by  chance.  Some  of  the  items  in  the  test,  however,  were  assumed  to  be 
biased  against  the  minority  subgroup  and  the  degree  of  their  bias  was  expressed 
in  terms  of  the  latent-trait  item  parameters. 


METHOD 


Independent  Variables 

Four  of  the  independent  variables  were  characteristic  of  the  test  admin- 
istered to  both  the  majority  and  minority  subgroups  simulated  in  this  study. 

Three  of  these  variables — distribution  of  item  difficulties,  level  of  item 
discrimination,  and  test  length — are  standard  characteristics  of  tests.  The 
fourth,  item  bias,  reflected  the  major  independent  variable  of  interest  in  this 
study.  The  fifth  independent  variable  was  Intended  to  vary  the  fairness  in  the 
application  of  test  scores.  This  variable  consisted  of  using  only  the  regression 
equation  from  the  majority  subgroup  or  differential  prediction,  for  the  predic- 
tion of  a simulate^  criterion  variable.  Figure  2 summarizes  the  independent 
variables  used  in  this  studv. 

le^r  Variables 


Only  conventional  tests  were  used  in  this  study.  That  is,  all  simulated 
Lcstees  within  an  cxpi_r imcntal  conditi...!  were  admini^wCred  luonticai  items  in  a 
fixed  sequence.  Test  items  were  represented  by  a set  of  latent  trait  parameters 
(Lord  & Novick,  1968,  p.  366)  which  described  the  essential  statistical  proper- 
ties ot  each  item.  A test  of  length  m with  a given  set  of  characteristics  was 
generated  by  seJe-'  ting  the  f i . ^ ■'/  items  from  one  of  eighteen  100-il  'W  poi'ls. 


ri|ur*  2 

Ind*p*«d«nt  Varlabl** 


Majority  Prediction 


Unit ora 
Difficulties 


Discrimination  (3)  Discrimination  (a) 
(3  levels)  (3  levels) 


Degree  of  Item  Bias  Degree  of  Item  Bias 
(3  levels)  (3  levels) 


Test  Length 
(S  lengths) 


Test  Length 
(5  lengths) 


Differential  Prediction 


Uniform 

Difficulties 


Peaked 

Difficulties 


Discrimination  (a)  Discrimination  (a) 
(3  levels)  (3  levels) 


Degree  of  Item  Bias  Degree  of  Item  Bias 
(3  levels)  (3  levels) 


Teat  Length 
(5  lengths) 


Test  Length 
(S  lengths) 


Each  item  pool  represented  one  of  the  exp '•rlmental  conditions  obtained  from 
taking  combinations  of  the  three  test  variables  summarized  in  Table  1.  For  all 
experimental  conditions  the  guessing  parameter,  o,  was  set  at  .20.  This  value 
is  the  expected  proportion  correct  if  purely  random  guessing  occurred  on  five- 
alternative  multiple-choice  items. 


Table  1 

Item  Pool  Parameter  Specifications 


Distribution 
of  Difficulties 

a 

Bias 

Uniform 

or 

Peaked 

.30 

.5 

Uniform 

or 

Peaked 

.30 

l.O 

Uniform 

or 

Peaked 

.30 

2.0 

Uniform 

or 

Peaked 

.70 

.5 

Uniform 

or 

Peaked 

.70 

1.0 

Uniform 

or 

Peaked 

.70 

2.0 

Uniform 

or 

Peaked 

1.10 

.5 

Uniform 

or 

Peaked 

1.10 

1.0 

Uniform 

or 

Peaked 

1.10 

2.0 

Distribution  of  item  difficulties.  Tests  were  simulated  which  had  either 
peaked  or  uniform  distributions  of  item  difficulties.  The  peaked  distributions 
of  difficulties  (b)  were  randomly  sampled  from  a normal  distribution  having  a 
mean  of  b=0  (where  0 indicates  an  item  of  average  difficulty)  and  a standard 
deviation  of  1.0.  The  uniform  distributions  of  difficulties  also  had  a mean  of 
"■=0  but  were  randomly  sampled  from  a uniform  distribution  which  ranged  from 
h=~2.99  to  +2.99.  The  actual  distribution  of  item  difficulties  used  in  each 
condition  is  summarized  in  Table  A in  the  Appendix. 


Item  discriminations.  Three  levels  of  item  discrimination  were  used  within 
both  the  peaked  and  uniform  tests.  These  three  levels  were  a=.30,  .70  and  1.00, 
corresponding  to  polnt-biserial  correlations  of  items  with  total  scores  of  .127, 
.373  and  .482,  respectively  (assuming  a population  proportion  passing  of  F=.6 
and  a guessing  parameter  of  a=.2).  Values  of  item  discrimination  were  held 
constant  within  each  testing  condition  and  subgroup. 


Test  length.  To  study  the  effects  of  test  length  and  its  interaction 
with  item  difficulty  distributions,  item  discrimination,  and  item  bias  on  test 
fairness,  fxve  typical  test  lengths  were  used.  Test  lengths  were  10,  30,  50, 

70  and  100  items.  Within  each  test  length,  discriminations  were  constant  for  a 
given  uniform  or  rectangular  test  and  a specified  degree  of  item  bias. 


Item  bias.  Item  bias  was  defined  as 


where  b . and  b . are  the  latent  trait  difficulty  parameters  for  the  majority 
maj  min 

and  minority  subgroups,  respectively. 


This  definition  of  item  bias  was  based  on  the  assumption  that  the  subgroups 
had  identical  true  ability  distributions  on  the  trait  being  measured,  but  that 
items  were  more  difficult  for  the  minority  subgroup  because  of  some  independent 
extraneous  factor(s)  which  reduced  their  performance  on  the  test  items.  For 
example,  if  a test  was  designed  to  measure  verbal  ability,  the  inclusion  of 
"culturally  loaded"  items  would  result  in  a test  which  would  be  more  difficult 
for  a nondominant  subgroup  of  a given  culture.  The  result  would  be  a test  which 
would  be  biased  against  such  minority  subgroups.  This  definition  of  item  bias  is 
very  similar  to  those  often  applied  in  practice  (Angoff  & Ford,  1973;  Breland 
et  at.,  1.974;  Echternacht,  1974).  The  main  difference  is  that  previous  models 
of  item  bias  have  been  based  on  the  proportion  correct  measure  of  item  difficulty. 
However,  proportion  correct  has  been  shown  (Lord  & Novick,  1968;  Urry,  1974)  to 
be  confounded  with  guessing  and  item  discrimination,  whereas  latent  trait  diffi- 
culty parameters  are  pure  measures  of  item  difficulty. 


Three  levels  of  item  bias,  based  on  Equation  1,  wfc,e  studied.  These  were 
.5,  1.0  and  2.0,  indicating  tests  which  were  respectively  more  difficult  for 
members  of  the  simulated  minority  subgroup.  Bias  was  Introduced  into  the  tests 
by  adding  this  constant  value  to  the  difficulty  parameters  of  the  items  selected 
to  constitute  the  majority  subgroup  test.  Item  discrimination,  guessing  and 
test  length  were  held  constant  as  bias  was  Introduced  into  the  testing  situation. 


Prediction  of  Latent  Ability 


A raw  test  score  was  obtained  for  each  simulated  testee  by  summing  the  num- 
ber of  correct  answers  for  that  testee.  Correct  answers  to  the  pth  item  were 

recorded  as  V =1,  while  incorrect  answers  were  represented  as  V =0.  Therefore, 

P . P 

the  raw  test  score  for  the  ith  individual  was 

m 

X.  I V , [2] 

^ p=l  P 

where  m=test  length. 

Since  the  objective  of  the  test  was  to  obtain  an  estimate  of  the  latent 
ability  0,  a method  was  needed  to  obtain  a prediction  of  0 based  on  the  test 
score  X^.  Linear  regression  equations  were  used  for  this  purpose.  Two  kinds  of 

regression  equations,  majority  and  differential  prediction,  were  used  correspond- 
ing to  two  types  of  prediction  procedures  often  mentioned  in  the  literature 
(Bartlett  & O’Leary,  1969;  Goldman  & Hewitt,  1975;  Jones,  1973;  and  McNemar, 

1975).  One  regression  equation  of  each  type — majority  prediction  and  differential 
prediction — was  developed  within  each  of  the  eighteen  testing  conditions.  The 
predicted  ability  scores  generated  by  these  regression  equations  were  used  to 
define  the  dependent  variables. 


Majority  prediction.  In  this  condition,  the  same  regression  equation 

0^  = a + &X^,  [3] 


where  a and  3 are  the  regression  parameters  based  on  only  the  data  from  the 
majority  subgroup,  was  used  to  predict  the  ability  of  all  individuals  regardless 
of  subgroup  membership. 


Differential  prediction.  In  this  condition,  separate  within-subgroup  re- 
gression equations  were  used  to  predict  ability  for  Individual  i of  subgroup  J. 
These  are  given  by 


0..=a.+3.Z.. 

^ <7 


[4] 


where  a.  and  3.  were  the  within-subgroup  regression  parameter  for  subgroup  .7, 

V tJ 

where  j referred  to  either  the  majority  or  minority  subgroup. 

Dependent  Variables 


The  dependent  variable  in  this  study  was  test  fairness.  Fairness  was 
evaluated  by  three  Indices  separately  for  each  of  the  180  combinations  of  inde- 
pendent variables  (i.e.,  item  difficulty  distribution  x Item  discrimination  x 
test  length  x bias  x prediction  method).  The  three  fairness  indices  were:  1)  a 

validity  index,  R;  2)  a Cleary-type  index,  C;  and  3)  a Thorndike-type  index,  T. 
These  fairness  measures  parallel  their  original  definitions.  But  in  this  study 


-10- 


the  variable  being  predicted  was  0,  the  known  true  latent  ability,  as  compared  to 
the  fallible  external  criterion  usually  used  in  research  on  test  fairness. 

In  addition  to  studying  the  effects  of  item  bias  and  other  test  characteris- 
tics on  these  three  definitions  of  fairness,  the  effects  of  the  independent 
variables  on  a number  of  standard  distributional  statistics  were  also  studied. 
These  included  the  mean,  standard  deviation,  standard  error  of  estimate,  skew- 
ness, and  kurtosis  of  the  ability  estimates,  0. 

The  /?-Index 


The  correlation  between  estimated  ability  and  the  true  latent  ability, 

has  been  used  in  latent  trait  studies  as  a measure  of  the  "goodness"  of  ability 
estimation  (Brogden,  1946;  Urry,  1969,  1971).  In  the  present  study  the  true 
ability,  0,  was  taken  as  the  criterion  for  selection.  Therefore,  can  be 


interpreted  as  a coefficient  of  predictive  validity.  For  simplicity,  this  coef- 
ficient of  validity  will  be  referred  to  simply  as  the  H-Index. 


Differences  in  R between  the  majority  and  minority  subgroups  were  examined 
as  an  indication  of  test  fairness.  Larger  correlations  for  one  group  as  compared 
to  the  other,  holding  testing  conditions  constant,  would  indicate  that  a given 
set  of  testing  conditions  produced  test  scores  with  a greater  potential  for  un- 
fairness for  the  group  having  the  lower  correlation. 


R was  evaluated  only  for  the  majority  prediction  condition  since  the  appli- 
cation of  differential  prediction  amounts  to  a linear  transformation  of  the 
majority  prediction  ability  estimates,  and  correlation  coefficients  are 
unaffected  by  linear  transformations. 

The  C-Index 


Based  on  Cleary's  (1968)  concept  of  test  bias,  the  degree  of  test  bias  in 
subgroup  j can  be  defined  as 


[5] 


where  0 . and  0 . are  the  means  of  the  ability  distributions  for  the  predicted  and 

J 3 

true  distributions,  respectively.  When  this  definition  is  applied  to  the  pre- 
dicted abilities  obtained  from  the  differential  prediction  equation  given  in 

Equation  4,  C .=0  in  all  cases.  This  follows  since  q.  will  always  equal  Q..  Con- 
3 3 3 

sequently,  the  utilization  of  differential  prediction  will  always  result  in  a 
fair  test  usage  according  to  the  Cleary  definition. 


The  inter-subgroup  difference 

C,,..  = (O  , - 0 , ) - (0  . 

diff  min  min  maj 


in  the  Cleary  index  is 

- 0 . ) = C , - C . . 

maj  min  maj 


[6] 
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Since  in  the  majority  prediction  condition  (Equation  3) 


0 . = ^ .. 

maj  maj 

Equation  6 simplifies  to 


diff  min 


Similarly,  in  the  differential  prediction  condition  (Equation  4) 

0 . = ^ . and  0 . = 0 . 

maj  maj  min  min 

and  Equation  6 simplifies  to 


^diff  = ° 


[7] 

[8] 


[9] 


[10] 


for  all  cases.  Consequently,  the  Cleary  index,  C,  was  also  evaluated  only  in 
the  majority  prediction  condition. 

The  T- Index 


Applying  Thorndike's  definition  of  fairness  to  the  model  used  in  this  study, 
a test  is  fair  if  the  following  condition  is  met: 


P(S  .>0  ) P(0  .>0  ) 

maj  o _ maj  o 

P(0  , >0  ) " 
min  o 


P(0  . >0  ) 
min  o 


[11] 


where P is  the  proportion  of  testees  who  exceed  the  cutoff  point  0^.  In  this 
study  a cutoff  equal  to  the  mean  of  the  majority  subgroup,  i.e.,  0^=0  was  used. 

Since  identical  subgroup  ability  distributions  were  assumed.  Equation  11 
reduced  to 


or 


P(S  ,>0  ) 
maj  o 

P(S  . >0  ) 

min  o 


= 1 


P(&  ,>0  ) = P(0  >0  ) 

maj  o ml  n o 


[12] 


[13] 


If  Equation  13  defines  a fair  selection  situation,  then  the  degree  to  which  a 
test  is  unfair  to  the  minority  subgroup,  as  compared  to  the  majority  subgroup, 
is  given  by 


diff 


(141 


X 


100 


1 

. ... 

[ 

i 

! 

i 
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or  simply  the  difference  between  the  percentage  of  Individuals  who  exceed  the 
selection  cutoff  in  the  minority  and  majority  subgroups.  The  T-Index  was  eval- 
uated in  both  the  majority  and  differential  prediction  conditions. 

i 

Data  Simulation 

Population 

i 

1 

The  selection  of  examinees  from  a target  population  was  simulated  with  a 
computer  by  generating  500  random  numbers  which  fell  between  the  values  of  -3.34 
and  +3.24  sampled  from  a normal  population  having  a mean=0  and  a S.D.=1.0.  Each 
of  the  random  numbers  represented  the  true  ability,  0,  for  one  testee:  0=0 

indicated  an  individual  of  average  ability,  while  0=3.0  indicated  a person  of 
very  high  ability  on  the  relevant  trait.  Since  the  same  population  distribution 
was  used  for  both  the  majority  and  minority  subgroups,  the  degree  of  unfairness 

i 

i 

1 

] 

i 

which  occurred  as  a result  of  the  characteristics  of  the  test  items  would  be 
manifested  as  differences  between  the  predicted  distributions  of  0 for  the  two 
subgroups.  Similarly,  the  same  500  values  of  G were  used  within  each  of  the  90 
experimental  conditions.  In  this  way,  differences  observed  in  the  dependent 
variables  could  be  attributed  solely  to  action  of  the  independent  variables. 

Simulation  Procedure 

The  procedure  used  to  simulate  testing  was  carried  out  in  three  stages: 

1)  response  vector  generation,  2)  application  of  test  models  to  response  vectors, 
and  3)  calculation  of  statistics  and  fairness  indicants. 


Response  generation.  Generation  of  test  responses  followed  procedures 
similar  to  those  used  by  Betz  & Weiss  (1973),  Vale  & Weiss  (1975)  and  McBride 
and  Weiss  (1976).  This  procedure,  based  on  latent  trait  test  theory  (Lord  & 
Novlck,  1968),  requires  two  assumptions.  The  first  assumption  was  local  inde- 
pendence of  responses,  which  requires  that  the  probability  that  a testee  of 
ability  0 will  answer  any  item  correctly  is  independent  of  whether  that  testee 
answers  any  other  item  correctly.  Stated  mathematically,  this  assumption  becomes 


.u  Q)  = 


m' 


^=l 


[15] 


where  f and  are  probability  density  functions,  i refers  to  one  of  the  m 
items,  and  y .=0  if  a response  was  Incorrect,  and  V.=l  if  correct. 

'I'  i- 


The  second  assumption  was  that  a response,  V-,  depended  only  on  1)  the 

Of 

ability  of  the  examinee,  and  2)  the  characteristics  of  the  test  items,  as  des- 
cribed by  each  item's  latent  trait  parameters  a,  h and  c. 


With  these  assumptions,  the  response  vectors  were  generated  by: 


J 


1.  Calculating  P.(0),  the  probability  of  answering  item  i correctly  given 
0,  from  the  normal  ogive  version  of  the  latent  trait  test  model. 
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L^(0) 

■"  <'  - V •/  -» 

where  L.(0)  = a..(0-Z>..), 

^ tj  tj 

ij)(t)  is  the  normal  density  function, 

j indicates  subgroup  membership  (majority  or  minority),  and 

a . .=  .20. 

■z-J 

2.  Determining  the  response  by: 

a.  Generating  a random  number  drawn  from  a uniform  distribution,  r, 
0<r><l . 

b.  If  y>P.(0),  u.=0. 

c.  If  r<P. (0) , u .=1 . 

3.  Repeating  this  process  for  each  item  used,  and  for  each  subgroup.  Two 
vectors  of  item  responses  were  generated  for  each  ability  level  for  each 
item  pool,  one  for  the  minority  subgroup  and  one  for  the  majority  sub- 
group. 

Test  administration.  The  response  vectors  served  as  input  to  a program 
which  simulated  the  testing  process.  Since  only  conventional  tests  were  sim- 
ulated in  this  study,  the  program  selected  items  sequentially  from  one  of  the 
eighteen  combinations  of  item  parameters.  This  process  was  repeated  for  each  of 
the  five  test  lengths  within  each  combination  of  the  other  sets  of  item  parameters. 
Varying  test  lengths  were  obtained  by  selecting  the  first  m items  out  of  the  100 
items  available,  where  m was  the  desired  test  length  (10,  30,  50,  70  or  100  items). 

Application  of  fairness  models.  The  output  of  the  second  stage  of  the 
simulation  w-’s  an  estimated  G,  S,  for  each  examinee  for  each  test  condition. 
Therefore,  e ul-stribution  of  true  and  estimated  0 values  was  produced  for  each 
subgroup  for  each  of  the  90  experimental  conditions.  Within  each  of  these  test 
conditions,  the  mean,  standard  deviation,  skewness  and  kurtosis  were  computed  for 
the  S variable,  and  the  validity,  Cleary,  and  Thorndike  measures  of  fairness 
(i.e.,  R,  C,  T)  were  calculated. 


RESULTS 

Distributions  of  Predicted  Scores 

Means,  standard  deviations,  skewness,  and  kurtosis  Indices  of  ability  est- 
imates as  a function  of  the  experimental  conditions  are  given  for  a test  length 
of  50  items  in  Table  2;  results  for  test  lengths  of  10,  30,  70  and  100  items. 
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which  generally  parallel  those  for  50  items,  are  given  in  Appendix  Tables  B through 
E.  In  these  tables,  the  statistics  for  the  true  ability  distribution  (0)  are 
given  in  the  first  row  of  the  table,  listed  under  the  "True"  group  heading.  In 
the  standard  deviation  column,  values  obtained  when  differential  prediction  (D.P.) 
was  used  are  given  as  well  as  values  for  the  majority  prediction  (M.P.)  case. 

Since  differential  prediction  did  not  affect  any  of  the  other  statistics,  only 
one  set  of  values  is  shown. 

As  Table  2 shows,  increasing  item  bias  caused  the  mean  of  the  minority  sub- 
group to  be  underpredicted.  The  degree  of  underprediction  increased  both  with 
increasing  item  bias  and  with  increasing  item  discrimination.  For  low  item  dis- 
crimination, the  degree  of  underprediction  was  less  than  the  degree  of  item  bias 
introduced,  with  the  degree  of  underprediction  being  somewhat  larger  for  the 
peaked  test  at  each  of  the  item  bias  levels.  At  high  item  discrimination  (a=i.l), 
the  degree  of  underprediction  became  essentially  equal  to  the  degree  of  bias  at 
the  .5  and  1.0  levels  of  item  bias.  With  item  bias  equal  to  2.0,  the  degrees  of 
underprediction  (-1.85  and  -1.52,  for  the  uniform  and  peaked  tests,  respectively) 
more  closely  approached  the  degree  of  bias  than  did  the  degrees  of  underpredic- 
Lioi.  '-1.34  and  -1.10)  in  the  low  item  discrimination  condition  at  this  same  bias 
level.  Also,  at  the  high  item  discrimination  level,  the  degree  of  underpredic- 
tlon  was  somewhat  smaller  for  the  peaked  test  at  each  item  bias  level. 


Tablt  2 


Score 

Oist  r ihuC ion 

Charact 

e r 1 s c 

i«-s  for  (’ 

onvent lonal 

1 Tests 

of  Length  50,  as 

Funct 1 

on  of  1)1. scrim 

in.il  ion 

(.1), 

iiml  (•roup, 

jr  Unift 

orm  and 

Peaked  Tes 

>ts 

Standard 

Deviat ion 

Mean 

Uni 

i form 

Peakt 

-d 

Sko< 

•mess 

Kurtosis 

a 

Bias 

(Iroup 

Uniform  Pe 

aked 

M.P. 

D.P. 

M.P. 

D.P. 

Uni  1 orm 

Peaked 

Uniform 

Peaked 

True 

-.076 

.076 

1 .006 

1 .006 

1.006  1 

.006 

-.01 

-.01 

.22 

.22 

. 10 

maj 

-.076 

.076 

. 798 

.798 

.807 

.807 

.03 

-.11 

.00 

-.06 

. 5 

min 

-.391 

.402 

.822 

.805 

.811 

.820 

.06 

.01 

-.09 

-.00 

1 

min 

-.709 

.738 

.819 

.810 

.826 

.822 

.08 

.10 

-.16 

-.02 

2 

nln 

-1.336  -1 

.097 

.819 

.815 

.800 

.816 

.28 

.33 

. 10 

.17 

.70 

ma} 

-.076 

.076 

.960 

.960 

.966 

.966 

-.08 

-.11 

-.27 

-.66 

.5 

min 

-.696 

.535 

.938 

.939 

.962 

.949 

. 10 

. 19 

-.33 

-.66 

1 

min 

-.953 

.896 

.929 

.936 

.903 

.961 

.31 

.69 

-.19 

-.38 

2 

min 

-1.769  -1 

.708 

.823 

.920 

.719 

.896 

.57 

1.13 

.08 

1.08 

maj 

-.074 

.076 

.96  7 

.96? 

.959 

.959 

-.03 

10 

-.25 

-M3 

.S 

min 

-.516 

.536 

.978 

.965 

.917 

.957 

.20 

.36 

-.26 

-.93 

1 

min 

-1.020 

.956 

.948 

.960 

.86  5 

.937 

.30 

.86 

-.33 

-.07 

2 
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. ion  equation. 

The  standard  deviation  of  the  ability  distribution  was  generally  under- 
predicted,  using  majority  prediction,  both  for  the  majority  and  minority  subgroups. 
For  the  uniform  test,  the  degree  of  underprediction  was  reduced  for  both  groups 
as  item  discrimination  increased,  while  for  the  peaked  test,  underprediction  in- 
creased for  the  minority  subgroup  while  it  decreased  for  the  majority  subgroup. 


Within  the  peaked  test,  the  degree  of  underprediction  of  the  standard  devia- 
tions became  especially  severe  with  increasing  item  bias,  at  high  discrimina- 
tions. 

When  differential  prediction  was  used,  however,  the  degree  of  underpredic- 
tion of  the  standard  deviation  was  substantially  reduced.  Even  at  a=l.l  for 
the  peaked  test,  underprediction  of  the  standard  deviation  for  the  minority  sub- 
group was  virtually  the  same  as  for  the  majority  subgroup,  except  for  very  high 
(2.0)  levels  of  item  bias. 

The  skewness  for  both  the  uniform  and  peaked  tests  increased  in  a positive 
direction  as  both  item  bias  and  item  discrimination  increased.  This  effect  was 
much  larger  for  the  peaked  test.  At  a=l.l  and  bias  of  2.0,  the  peaked  test  had 
a skewness  of  2.13  compared  to  .77  for  the  uniform  test.  The  kurcosis  measure 
indicated  that  the  shape  of  the  distribution  changed  from  being  somewhat  flat 
(negative  value)  to  being  peaked  (positive  value)  as  item  bias  was  increased; 
the  degree  of  this  change  was  a function  of  increasing  item  discrimination. 
Again,  the  uniform  test,  when  compared  to  the  peaked  test,  more  closely  main- 
tained its  resemblance  to  the  true  normal  distribution  as  bias  was  increased. 

Fairness 


Validiti 


7?-Index 


Effects  on  majority  subgroup.  The  validity  coefficients  for  the  uniform  (U) 
and  peaked  (P)  distributions  of  item  difficulties  are  shown  in  Table  3.  The 
three  rows  in  Table  3 labeled  "maj"  give  the  validities  for  the  majority  sub- 
group for  the  three  values  of  item  discrimination.  These  results  correspond  to 
the  case  where  item  bias  is  zero. 

Validity  was  found  to  increase  as  item  discrimination  and  test  length  in- 
creased for  both  types  of  item  distributions.  At  the  lower  discrimination  levels 
a=.30  and  .70,  the  peaked  distribution  gave  higher  values  of  validity;  but  at  the 
high  discrimination  level,  a=l.l,  and  for  test  length  longer  than  about  40,  the 
advantage  reversed  and  the  uniform  distribution  gave  higher  validities.  The 
highest  validity  found  was  7?=.981  for  the  uniform  distribution  of  item  difficul- 
ties at  a=l.l,  for  a test  length  of  100  items.  The  validity  for  peaked  tests  at 
this  same  point  was  R=.967.  The  lowest  validity  also  occurred  for  the  uniform 
distribution.  At  a=.30  for  test  length=10,  7?=.493,  while  7?=.540  for  the  peaked 
distribution. 


Validity  differences.  A major  concern  with  respect  to  test  fairness  refers 
not  only  to  how  validity  varies  as  a function  of  the  test  characteristics  for  a 
given  subgroup,  but  more  importantly,  how  validity  varies  differentially  among 
subgroups.  The  reason  for  this  is  that  if  a difference  in  subgroup  validities 
does  exist,  this  would  imply  that  the  predictions  made  on  the  basis  of  the  test 
scores  are  not  as  accurate  for  one  subgroup  as  for  the  other.  As  was  explained 
in  the  Introduction,  such  a difference  in  validity  would  have  several  adverse 
effects  on  the  subgroup  having  the  lower  correlation.  Therefore,  the  effects  of 
item  bias  on  validity  were  studied  by  comparing  the  validities  for  both  sub- 
groups for  all  the  item  pools  and  test  lengths.  To  facilitate  this  analysis. 
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differencss  between  subgroup  validities  were  determined.  Differential  validity 
was  thus  defined  as 

= /?  . - R ..  [17] 

diff  min  maj 

A negative  value  of  differential  validity  indicates  that  the  majority  subgroup 
had  a larger  validity  coefficient  than  the  minority  subgroup.  These  values 
appear  in  Table  3 in  the  rows  designated  "diff". 

Table  3 shows  that  for  the  lowest  a-value,  validity  differences  were  very 
small  for  low  levels  of  item  bias.  As  item  bias  increased,  differential  validity 
increased  for  tha  uniform  test,  but  decreased  for  the  peaked  test,  except  at  100 
items  where  differential  validity  decreased  for  both  tests.  At  a=.30,  differen- 
tial validity  tended  to  be  positive  in  favor  of  the  minority  subgroup  for  both 
types  of  tests.  But  for  item  discriminations  of  a=.l  and  1.1  for  test  length  of 
30  and  above,  the  direction  of  differential  validity  was  reversed  and  the  tests 
became  unfair  to  the  minority  subgroup.  As  the  degree  of  item  bias  and  item 
discrimination  Increased,  the  size  of  this  negative  differential  became  substan- 
tial, particularly  for  the  peaked  tost.  This  effect  was  present  at  all  test 
lengths  above  10  items.  For  example,  the  peaked  test  at  a=l.l,  test  length=100, 
and  bias=2.0,  had  a .125  difference  between  the  subgroup  validities,  in  favor  of 
the  majority  subgroup.  The  largest  negative  differential  validity  for  the  uni- 
form tests  was  i?^^^^=-.028  which  occurred  at  a=l.l,  test  length=50,  bias=2.0. 


(7-Index 


The  Cleary-type  fairness  measure,  C,  was  defined  as  the  difference  between 
the  means  of  the  true  ability,  W,  and  the  predicted  ability,  (§.  Therefore,  the 
(7-Index  is  in  the  same  units  as  0.  The  population  distribution  of  0 had  a mean 
of  0 and  a standard  deviation  of  1.0.  A negative  (7-Index  implies  unfairness  for 
a subgroup.  In  Figure  3,  bhe  subgroup  differences  in  the  (7-indices 


((7  . - (7  . ) , are  plotted  against  test  length  for  both  the  uniform  and  peaked 

maj  min 

tests  for  all  item  pools  in  the  majority  prediction  condition.  As  indicated  bv 
Equation  8,  under  the  assumptions  of  the  present  study,  since 

(7maj=0.  Numerical  values  of  C by  subgroup  are  shown  in  Appendix  Table  F. 

was  not  computed  under  differential  prediction  since,  as  indicated  earlier,  by 
definition  it  is  always  equal  to  zero  in  this  condition. 


As  would  be  expected,  the  (7- Index  indicated  Increased  unfairness  for  the 
minority  subgroup  as  item  bias  was  increased  from  .5  to  2.0.  Unfairness  also 
increased  as  a negatively  accelerating  function  of  test  length  reaching  its 
highest  value  at  a test  length  of  100.  The  rate  of  increase  as  well  as  the  high- 
est value  varied  as  a function  of  item  discrimination  and  degree  of  item  bias. 
For  both  the  uniform  and  peaked  tests,  increasing  item  bias  tended  to  increase 
the  rate  of  increase  of  C with  test  length  within  a level  of  item  discrimination. 
The  effect  of  test  length  decreased  as  item  discrimination  increased. 
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There  appeared  to  be  very  little  difference  between 

distribution  of  difficulties  on  at  the  <;=.3  and  .7 

ditr 

ination.  The  differences  which  do  occur  appear  to  favor 


the  peaked  and  uniform 
levels  of  item  discrim- 

the  uniform  tests  at  the 


Figure  3 

C-lndex  as  a function  of  item  discrimination  (a) , 
item  bias,  and  test  length,  using  majority  prediction 
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.5  and  1.0  bias  levels  at  a=.3  (Figure  3a)  and  .7  (Figure  3b),  and  the  peaked 
tests  at  bias  of  20  when  a= .1 . However,  at  the  highest  discrimination  level, 
a=l.l  (Figure  3c),  the  uniform  tests  were  more  unfair  than  the  peaked  tests  to 
the  minority  subgroup  when  the  degree  of  item  bias  was  large  (2.0).  For  an  item 
bias  of  2.0,  differences  of  .350  and  .342  were  found  between  the  subgroup  C 
values  for  test  lengths  of  70  and  100  items.  Thus  for  this  test  situation,  using 
peaked  Instead  of  uniform  distributions  of  difficulty  would  produce  an  average 
estimate  of  ability  with  a decrease  in  item  bias  of  more  than  one-third  of  a 
standard  deviation  relative  to  the  population  of  true  abilities. 

T- Index 


T can  be  defined  as  the  difference  between  the  T-indices  for  the  major- 
dif  f 

ity  and  minority  subgroups,  i.e.,  A negative  indicated  that 

the  percent  predicted  to  be  above  average  was  smaller  for  the  minority  than  for 
the  majority  subgroup;  i.e.,  the  test  was  less  fair  to  the  minority  subgroup. 


Majority  prediction.  As  Figure  4 shows,  using  majority  prediction, 

varied  in  a complex  way  as  a function  of  item  discrimination,  test  length  and 
degree  of  item  bias,  for  the  uniform  and  peaked  tests  (numerical  values  are 


Figure  4 

r-Index  as  a function  of  item  discrimination  (a) , 
item  bias,  and  test  length,  using  majority  prediction 
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given  in  Appendix  Table  G) . In  general,  however,  the  uniform  tests  were  less 
unfair  to  the  minority  subgroup  at  the  a=.3  and  .7  levels  of  item  discrimina- 
tion (Figure  4a  and  4b,  respectively),  but  showed  no  clear  advantage  at  the  1.1 
level  (Figure  4c)  except  for  the  shortest  test  length.  Regardless  of  item 
discrimination  or  degree  of  item  bias,  the  shortest  and  longest  test  lengths  of 
the  uniform  test  resulted  in  relatively  greater  fairness.  Only  for  the  inter- 
mediate test  lengths  did  the  peaked  test  sometimes  produce  a smaller  than 

did  the  uniform  test  and  then  usually  at  the  higher  discrimination  levels.  In 
contrast  to  the  C-Index,  unfairness  measured  by  did  not  increase  as  a 

regular  function  of  test  length  for  the  peaked  test  at  item  discrimination  levels 
above  a=.30. 

The  largest  difference  in  between  uniform  and  peaked  tests  was  11.2%, 

occurring  at  the  highest  bias  and  discrimination  levels  at  test  length=10 
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(Figure  4c).  Even  for  the  a=.70,  bias=.5,  test  length=100  (Figure  4b),  a test 
which  might  be  representative  of  one  used  in  real  selection  situations,  the 
uniform  test  would  have  led  to  the  selection  of  12.8%  fewer  minority  applicants. 

Differential  prediction.  The  results  of  using  differential  prediction  on 
r-fairness  are  shown  in  Figure  5;  numerical  values  are  in  Appendix  Table  H. 

Since  for  the  differential  prediction  case  used  the  same  T value  for  the 

majority  subgroup,  but  a different  value  for  the  minority  subgroup,  results  from 
the  two  prediction  situations  directly  show  the  reduction  in  unfairness  due  to 
differential  prediction.  A comparison  of  Figure  4 with  Figure  5 thus  shows  that 


Figure  5 

T-Index  as  a function  of  item  discrimination  (a) , 
item  bias,  and  test  length,  using  differential  prediction 


(a)  Ch)  M 


No.  Items 
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the  main  effect  of  using  differential  prediction  was  that  a much  larger  percen- 
tage of  minority  applicants  was  predicted  above  average  than  was  the  case  when 
majority  prediction  was  used.  Consequently,  the  general  level  of  unfairness  was 
reduced  using  differential  prediction. 
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Figure  5 shows  that  with  differential  prediction,  the  minority  subgroup 
sometimes  had  a greater  percentage  of  examinees  above  the  mean  than  did  the 
majority  subgroup.  This  is  a situation  which  never  occurred  in  the  majority 
prediction  case  (Figure  4).  For  the  most  part,  this  overprediction  for  the 
minority  subgroup  occurred  almost  entirely  for  the  uniform  test  and  tended  to 
decrease  as  test  length  and  item  discrimination  increased.  Overprediction  vir- 
tually disappeared  for  test  lengths  greater  than  30,  at  item  discriminations  of 
a=.70  and  1.1  (Figures  5b  and  5c,  respectively).  On  the  average,  both  the 
peaked  and  uniform  tests  tended  to  give  higher  negative  values  of  item 

discrimination  increased,  indicating  increased  unfairness,  even  using  differ- 
ential prediction.  This  effect  was  particularly  pronounced  for  the  peaked 
tests;  the  unfairness  of  uniform  tests  was  less  affected  by  increasing  item 
discrimination. 

The  uniform  tests,  with  only  one  exception,  produced  values  of  r,.,,  that 

dl  1 1 

were  less  negatively  biased  than  the  peaked  tests.  This  superiority  of  the 
uniform  tests  increased  as  the  degree  of  both  bias  and  item  discrimination 
increased.  The  difference  was  particularly  large  for  tests  of  shorter  length. 
For  a=l.l,  bias=2.0  and  test  length=10,  there  was  a difference  between  the 
uniform  and  peaked  tests  of  23.4%  in  the  percentages  of  minority  testees 
predicted  to  be  above  average. 


DISCUSSION 

Effects  of  Item  Characteristics  on  Validity 

There  has  been  considerable  previous  research  (Brogden,  1946;  Cronbach  & 
Warrington,  1952;  Gulliksen,  1945;  Lord,  1952;  Tucker,  1946)  on  the  relation- 
ship between  item  statistics  and  test  validity.  It  generally  has  been  shown  that 
the  best  distribution  of  item  difficulties  for  maximizing  validity,  i.e.,  corre- 
lation with  underlying  true  ability,  depends  on  a number  of  factors  including 
the  level  of  item  discrimination.  However,  other  things  being  equal  {e.g.,  the 
ability  distribution  peaked  near  the  difficulty  level  of  the  items),  a higher 
validity  will  be  achieved  with  a peaked  distribution  of  item  difficulties  than 
with  a uniform  distribution  of  item  difficulties,  unless  items  with  very  high 
discriminations  are  employed.  This  result  has  led  many  test  constructors  to 
recommend  the  general  use  of  peaked  tests,  since  the  level  of  item  discrimina- 
tion at  which  the  uniform  test  gives  higher  validity  was  generally  thought  to  be 
too  high  to  occur  in  realistic  testing  situations. 

However,  most  of  the  previous  research  was  conducted  using  conventional  item 
statistics.  It  has  been  shown  (Lord,  1975;  Urry,  1974)  that  conventional  item 
statistics  confound  the  effects  of  guessing  with  item  difficulty.  When  guessing 
effects  are  properly  accounted  for  by  using  latent  trait  parameters,  the  level 
of  item  discrimination  at  which  the  uniform  test  produces  higher  validity  is  well 
within  the  range  which  occurs  in  common  practice.  This  result  was  first  reported 
by  Urry  (1969,  p.  140;  1974)  and  was  reaffirmed  in  the  present  study. 

At  discrimination  levels  of  a=.3  and  .7  corresponding  to  point -biserlal 
correlations  of  item  response  and  total  score  of  .187  and  .373,  respectively, 
the  peaked  test  produced  a higher  validity,  although  its  advantage  over  the  uni- 
form test  tended  to  decrease  with  increasing  test  length.  These  results  are 
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‘ i similar  to  what  has  been  reported  with  conventional  item  statistics.  However,  at 

j the  a=l.l  level  of  discrimination  (corresponding  to  point-biserials  of  .48)  and 

: j for  tests  of  50  items  or  more,  the  uniform  test  produced  higher  validities  than 

' ! the  peaked  test.  For  a 100-item  test  at  a=l.l,  validity  was  .981  for  the  uniform 

test  compared  to  .967  for  the  peaked  test.  This  represents  a substantial  in- 
crease in  validity  at  this  high  level  of  correlation.  Therefore,  it  would  appear 
that  the  uniform  test  might  be  preferable  in  many  practical  situations. 

I 

i 

I Effects  of  item  bias.  When  test  items  were  biased  against  the  minorltv 

I subgroup,  validity  generally  decreased  as  item  bias  increased  (except  at  low  item 

f discrimination  levels)  for  both  peaked  and  uniform  tests.  This  effect  produced 

I validity  differences  between  the  minority  and  majority  subgroups  since  items  were 

unbiased  relative  to  the  majority  subgroup.  Furthermore,  these  validity  differ- 
ences increased  at  a given  level  of  item  bias  as  item  discrimination  increased. 
The  implication  of  these  results  is  that  if  items  are  biased,  increasing  item 
discrimination  can  decrease  test  fairness  as  reflected  by  subgroup  validity 
differences . 


Different  types  of  tests  produced  different  levels  of  unfairness  as  measured 
by  the  validity  index.  Where  item  discrimination  was  at  least  a=.7,  the  uniform 
test  was  clearly  superior  to  the  peaked  test  in  producing  a fair  test.  The 
advantage  of  using  the  uniform  test  increased  with  increasing  item  discrimination 
and  test  length.  With  a peaked  tost,  at  a=\.\  and  a test  length  of  100,  the 
minority  subgroup  had  a validity  .125  below  that  of  the  majority  subgroup.  Under 
these  conditions,  there  was  only  a .021  difference  in  subgroup  validities  when  a 
uniform  test  was  used. 

These  results  have  several  implications  for  the  construction  of  tests  and  for 
the  interpretation  of  existing  test  data.  First,  they  offer  a possible  explan- 
ation for  the  often-reported  but  controversial  phenomenon  of  differential  validity. 
Several  researchers  (Campbell  et  al.  , 1973;  Farr  et  at.,  1971;  Schmidt,  Berner,  & 
Hunter,  1973)  have  presented  arguments,  based  on  various  analyses  of  empirical 
data,  that  differential  validity  does  not  exist  as  a substantive  phenomenon.  The 
results  of  this  study  indicate  that  differential  validity  is  a definite  possibility 
and,  in  fact,  can  be  expected  when  test  items  are  biased  against  one  of  the  sub- 
groups being  tested.  The  fact  that  validity  differences  are  not  often  detected 
in  practice  may  be  due  to  the  problem  of  generating  sufficient  statistical  power 
to  detect  a difference  when  it  exists  (Bartlett,  Bobko,  & Pine,  in  press). 

Thus,  if  test  items  are  biased,  differential  validity  is  the  expected  result. 
Furthermore,  the  usual  practice  of  selecting  items  having  the  highest  item  dis- 
criminations will  have  the  effect  of  increasing  subgroup  validity  differences, 
particularly  in  peaked  tests. 

Other  Models  of  Selection  Fairness 

In  the  context  of  this  study,  the  C-Index,  based  on  Cleary's  fairness  model, 
gave  the  degree  of  statistical  bias  in  the  estimation  of  a known  criterion  value. 
The  T-Index,  based  on  Thorndike's  definition  of  fairness,  reflected  the  Impact 
of  estimator  bias  on  the  percentage  of  applicants  predicted  to  exceed  some  quali- 
fying point  of  ability,  in  this  case,  the  mean  of  the  population. 

The  Cleary  view  of  fairness  tends  to  optimise  selection  from  the  vantage 
point  of  the  selecting  Institution  since  it  ass<’.res  that  the  ablest  candidates 
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wlll  be  selected.  The  Thorndike  model  tends  to  be  more  liberal  from  the  viewpoint 
of  the  minority  subgroup.  Even  in  situations  where  the  Cleary  index  indicates  a 
perfectly  fair  test,  it  has  been  previously  shown  by  Schmidt  & Hunter  (1974)  that 
the  Thorndike  index  may  still  Indicate  unfairness.  This  result  was  replicated  in 
the  present  study. 

Furthermore,  both  models  indicated  that  the  nature  of  a test,  in  terms  of  its 
spread  of  item  difficulties,  can  have  a strong  effect  on  fairness  at  some  levels 
of  item  discrimination  and  for  some  test  lengths.  For  the  levels  of  discrimina- 
tion and  test  lengths  most  commonly  found  in  practice,  the  general  finding  was 
that  the  peaked  test  was  fairer  in  terms  of  the  C-Index,  while  the  uniform  test 
was  fairer  in  terms  of  the  T-Index,  when  majority  prediction  was  employed. 

The  differential  prediction  condition  indicated  the  conservative  nature  of 
the  C-Index.  By  definition,  in  this  condition,  all  tests  were  perfectly  fair  by 
the  Cleary  model.  Yet  the  y-Index  indicated  the  presence  of  substantial  unfair- 
ness, particularly  for  very  short  tests  and  for  highly  olscriminating  tests. 
Furthermore,  with  differential  prediction  of  ability,  the  uniform  distribution  of 
item  difficulties  predicted  more  minority  testees  to  be  above  average  across 
nearly  all  conditions  than  did  the  peaked  distribution  of  item  difficulties. 

£^-Index.  One  of  the  major  trends  in  the  data  is  shown  in  Figure  3;  for 
both  the  peaked  and  uniform  tests,  the  effect  of  item  bias  on  the  C-Index  in- 
creased with  test  length.  This  implies  that  the  shorter  a test  is,  the  more  fair 
it  will  be  in  terms  of  producing  a smaller  underprediction  of  the  minority  ability 
level.  In  other  words,  shorter  tests  are  less  sensitive  (more  robust)  to  the 
presence  of  item  bias  than  are  longer  tests.  Unfortunately,  this  finding  runs 
contrary  both  to  conventional  wisdom  and  to  the  results  from  the  validity  index 
which  indicated  an  increase  in  validity  with  increasing  test  length. 

The  reason  for  this  seemingly  paradoxical  result  is  that  the  longer  a test 
is,  the  more  chance  there  is  for  bias  to  affect  the  final  test  score.  For 
example,  if  a test  is  only  one  item  long,  the  only  possible  test  scores  are  0 and 
1.  Therefore,  there  is  not  as  much  opportunity  for  bias  to  affect  the  test  score. 
On  the  other  hand,  if  a test  is  very  long,  even  a small  degree  of  bias  can  be 
reflected  in  the  score. 

The  influence  of  test  length  on  fairness  as  measured  by  the  C-Index  was 
reduced,  however,  by  Increasing  the  level  of  item  discrimination.  What  this  im- 
plies is  that  the  length  of  a test  plays  a much  larger  role  in  the  ultimate  fair- 
ness of  a test  at  the  lower  levels  of  discrimination  than  it  does  at  the  higher 
levels.  For  example.  Figure  3 indicates  that  if  item  bias  is  relatively  large 
(2.0),  the  extent  to  which  the  minority  subgroup  is  underpredicted  will  vary 
from  1 to  1.5  standard  deviations  as  test  length  increases  from  30  to  100  items. 

At  the  highest  level  of  discrimination,  however,  the  increase  in  underprediction 
is  relatively  constant  between  these  test  lengths.  Consequently,  in  order  to 
achieve  a high  level  of  validity  and  the  smallest  possible  underprediction  of  the 
minority  subgroup,  the  highest  possible  level  of  item  discrimination  should  be 
maintained,  particularly  for  short  tests. 

If  a test  uses  highly  discriminating  items,  the  distribution  of  item  diffi- 
culties will  become  an  important  factor  in  test  fairness  as  measured  by  the 


(7-Index.  For  highly  discriminating  items,  if  there  is  reason  to  suspect  a 
relatively  high  degree  of  item  bias,  the  results  of  this  study  indicate  that  a 
peaked  test  is  to  be  preferred  over  a uniform  test.  Unfortunately,  this  con- 
clusion conflicts  with  the  findings  based  on  the  validity  data  where  it  was 
found  that  a uniform  test  produced  the  smallest  difference  in  validities  with 
highly  discriminating  items.  Apparently,  a decision  must  be  made  as  to  which 
criterion  is  most  important  in  a given  situation — reduction  in  the  difference 
between  subgroup  validities,  or  reduction  in  the  underprediction  of  the  minority 
subgroup. 

In  making  this  decision,  the  test  constructor  must  carefully  consider  the 
degree  of  precision  which  must  be  sacrificed  in  order  to  reduce  the  relative 
degree  of  unfairness  to  a minority  subgroup.  Some  minimum  degree  of  precision 
must  surely  be  maintained  or  one  could  end  up  with  a perfectly  fair,  but  totally 
useless  selection  instrument.  This  situation  would,  for  example,  be  approached 
by  employing  very  short  tests  using  items  with  very  low  discrimination. 

r-Index.  As  was  the  case  with  the  (7-Index,  increasing  average  item  discrim- 
ination had  the  overall  effect  of  Increasing  unfairness  as  measured  by  the 
T-Index.  The  relationship  between  fairness  as  measured  by  the  T-Index  and  test 
length,  however,  was  more  complicated  than  it  was  when  fairness  was  measured  by 
the  C-Index.  For  some  levels  of  item  discrimination,  T-fairness  increased  with 
test  length,  while  in  other  cases  it  decreased.  In  general,  however,  the  fairest 
tests  were  the  shortest  tests  using  the  least  discriminating  items.  This  is  the 
same  result  found  for  the  (7-Index  and  was,  again,  probably  due  to  the  restriction 
in  the  number  of  unique  scores  possible  and  the  increased  unreliability  charac- 
teristic of  a short  test. 

Results  for  the  T-Index  indicated  that  the  uniform  test  was  consistently 
less  adversely  affected  by  item  bias  than  was  the  peaked  test  for  the  lower 
levels  of  item  discrimination.  However,  at  higher  item  discriminations,  neither 
test  design  was  obviously  favorable. 

Implications.  Some  generalizations  about  test  design  can  be  made  based  on 
these  results.  Specifically,  at  moderate  levels  of  item  discrimination  and  test 
lengths  above  50  items,  uniform  tests  are  clearly  superior  to  peaked  tests  in 
terms  of  reducing  unfairness.  This  conclusion  holds  for  all  three  fairness  in- 
dices. At  high  item  discrimination  levels  (above  a=l.l),  where  uniform  and 
peaked  tests  produced  conflicting  results  in  terms  of  validity  and  (7-faimess, 
the  distinction  between  distribution  of  item  difficulties  and  fairness  is  less 
clear.  At  these  levels,  the  distribution  of  item  difficulties  does  not  seem  to 
make  much  difference  as  long  as  the  tests  are  at  least  moderately  long  (greater 
than  30  items).  Also,  at  these  high  levels  of  item  discrimination,  the  expected 
loss  in  relative  test  validity  for  the  minority  subgroup  would  be  small.  There- 
fore, in  view  of  the  superiority  of  peaked  tests  in  terms  of  C-falrness  under 
these  conditions,  they  would  generally  be  preferable  to  the  uniform  tests. 

Differential  Prediction 

When  differential  prediction  is  used,  a test  will  always  be  fair  in  terms 
of  Cleary's  definition  of  fairness.  That  is,  there  will  be  no  overprediction  or 
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underpredictlon  of  mean  ability  level  for  that  subgroup.  Similarly,  within  the 
model  used  in  this  study,  the  use  of  differential  prediction  will  not  be  re- 
flected in  the  i?-Index  since  it  amounts  to  andini;  a constant  to  the  scores  of 
the  minority  subgroup.  Such  a constant  will  not  change  the  correlation  of  test 
scores  with  another  variable. 

However,  a test  may  be  unfair  according  to  the  Thorndike  definition  of  un- 
fairness In  the  differential  prediction  condition.  The  degree  of  unfairness 
will  depend  on  the  item  discrimination  level,  test  length  and  distribution  of 
item  difficulties.  As  was  the  case  for  C-falrness  and  for  T-fairness  using 
majority  prediction,  differential  prediction  was  accompanied  by  an  overall  de- 
crease in  fairness  to  the  minority  subgroup  as  average  item  discriminations 
increased.  The  relationship  between  T-f airness  and  test  length,  however,  was 
much  more  pronounced  in  the  differential  prediction  case.  The  distribution  of 
item  difficulties  also  had  a much  larger  effect  in  the  differential  prediction 
condition. 

The  most  interesting  effect  was  due  to  distribution  of  item  difficulties. 

The  uniform  tests  resulted  in  scores  which  were  more  fair  to  the  minority  sub- 
group than  were  scores  on  the  peaked  tests  for  almost  all  test  lengths  and 
degrees  of  item  bias.  The  differences  in  T-fairness  between  the  uniform  and 
peaked  tests  were  particularly  large  at  the  shortest  and  longest  test  lengths. 

At  the  highest  level  of  item  discrimination  (a=l.l),  the  uniform  tests  showed  a 
clear  and  substantial  advantage  over  the  peaked  tests. 

The  differences  that  occurred  between  the  uniform  and  peaked  tests  in  the 
differential  prediction  condition  were  mainly  due  to  the  skewness  and  kurtosis 
of  the  predicted  score  distributions  obtained  in  the  respecti.e  conditions.  As 
can  be  seen  in  Table  2,  the  uniform  tests  produced  a predicted  score  distribution 
that  was  flatter  and  less  skewed  than  that  of  the  peaked  tests.  These  differ- 
ences in  the  shape  of  the  predicted  score  distributions  increased  as  item  discrim- 
ination was  increased. 

The  effect  of  the  shape  of  the  predicted  score  distribution  is  much  greater 
in  the  differential  prediction  condition  than  in  the  majority  prediction  condi- 
tion because  of  the  relationships  in  the  distribution  between  the  mean  of  the 
score  distributions  and  the  selection  cutoff.  These  effects  can  be  seen  in 
Figures  4 and  5.  Figure  4 represents  the  case  where  majority  prediction  was  used 
and  the  test  items  were  biased  against  the  minority  subgroup.  This  situation 
will  result  in  the  mean  § of  the  minority  subgroup  being  below  that  of  the  major- 
ity subgroup.  Since,  in  this  case,  such  a small  percentage  of  the  minority  sub- 
group is  above  the  majority  subgroup  average,  differences  in  the  predicted 
distributions  as  a function  of  spread  in  item  difficulties  have  a relatively 
small  effect  on  T-fairness.  However,  with  differential  prediction  (Figure  5) 
there  will  be  no  bias  in  the  predicted  means  for  either  subgroup.  Consequently, 
the  effects  of  skewness  and  kurtosis  on  T-fairness  are  much  larger. 

When  differential  prediction  was  used,  the  uniform  test  was  fairer  to 
the  minority  subgroup  than  was  the  peaked  test.  This  result  was  observed  across 
test  length  and  item  discrimination  conditions.  For  the  higher  discrimination 
levels,  this  result  was  consistent  with  the  results  from  the  validity  data. 
Therefore,  uniform  tests  are  clearly  preferable  when  used  in  combination  with 
differential  prediction.  These  results  also  imply  that  if  differential  predic- 
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tion  is  employed,  it  is  possible  to  avoid  the  problem,  often  encountered  using  j 

majority  prediction,  of  trying  to  simultaneously  minimize  differential  validity  j 

and  C-  or  f-fairness . : 


SUMMARY  AND  CONCLUSIONS 

This  study  was  con«.emed  with  how  test  fairness,  defined  in  terms  of  test 
validity  and  the  models  presented  by  Cleary  and  Thorndike,  is  Influenced  by 
test  length,  distribution  of  item  difficulties,  level  of  item  discrimination 
and  degree  of  item  bias.  The  methodology  involved  computer  simulation  in  which 
bias  and  fairness  were  represented  in  the  context  of  latent  trait  theory.  This 
approach  eliminates  many  of  the  criterion  measurement  problems  often  present  in 
empirical  validation  studies,  and  allows  direct  observation  of  the  influence  oi 
item  characteristics  on  test  scores  and  on  predictions  made  from  those  test 
scores.  The  situation  assumed  in  the  present  study  was  that  a single  test  was 
used  to  select  an  unrestricted  sample  of  applicants  from  a hypothetical  popula- 
tion consisting  of  a minority  and  a majority  subgroup.  The  criterion  on  which 
the  selections  were  validated  was  a unidimensional  variable  on  which  the  sub- 
groups had  identical  distributions. 

Validity 

The  findings  from  the  validity  data  indicated  that  contrary  to  the  results 
of  previous  research,  a uniform  test  often  led  to  a higher  validity  for  many 
. ractical  test  applications  than  did  a peaked  test  In  fact,  if  item  discrimina- 
tions were  relatively  high,  uniform  tests  resulted  in  substantially  higher 
validities  than  did  peaked  tests.  More  importantly,  with  respect  to  the  issue  of 
test  fairness,  the  difference  between  subgroup  validities  could  be  reduced  by 
using  uniform  rather  than  peaked  tests.  It  was  also  found  that  validity  differ- 
ences such  as  those  reported  and  often  disputed  in  the  testing  literature,  are 
to  be  expected  when  test  items  are  biased  against  one  of  the  applicant  subgroups. 
The  fact  that  such  validity  differences  are  not  always  found  in  empirical  valid- 
ation studies  is  probably  due  to  the  lack  of  power  in  the  statistical  tests  used 
in  these  empirical  Investigations. 

Selection  Fairness  Models 


The  shapes  of  both  the  subgroup  score  distributions  and  the  predicted  ability 
distributions  were  found  to  be  very  much  affected  by  the  characteristics  of  the 
items  included  in  the  selection  instrument.  Conclusions  drawn  from  each  of  the 
models  used  for  measuring  selection  fairness  were  a function  of  the  predicted 
ability  distributions.  Consequently,  selection  fairness  was  found  to  be  a 
function  of  a test's  item  characteristics  as  well. 

Perhaps  the  most  relevant  finding  for  test  construction  was  that  certain  com- 
binations of  item  characteristics  were  more  robust  in  the  presence  of  item  bias 
than  were  others.  That  is,  item  bias  had  less  of  an  effect  on  fairness  for  some 
combinations  of  item  discrimination,  test  lengths,  and  distribution  of  item 
difficulties,  than  for  others.  The  relationships  among  these  variables  were 
very  complex.  In  any  practical  application  where  it  is  necessary  to  know  how  a 
particular  set  of  item  characteristics  will  affect  the  fairness  of  a test,  a 
simulation  study  should  be  implemented  in  which  the  conditions  of  the  application 
are  approximated  as  closely  as  possible. 
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Nevertheless,  certain  generalizations  can  be  made  based  on  the  present 
results.  If  applicants  are  to  be  selected  in  a situation  similar  to  the  condi- 
tions assumed  in  this  study,  a test  having  a uniform  spread  of  item  difficulties 
will  result  in  fairer  predictions  than  will  a peaked  test,  if  a reasonably  high 
level  of  item  discrimination  can  be  maintained.  Also,  the  differential  predic- 
tion model  can  be  expected  to  provide  fairer  selection  than  will  sole  reliance  on 
majority  prediction  equations.  Furthermore,  the  advantages  of  using  a uniform 
test  will  be  enhanced  in  the  differential  prediction  application. 

The  results  from  the  differential  prediction  condition  indicate  the  conser- 
vative nature  of  Cleary's  fairness  model  as  compared  to  Thorndike's  model.  The 
use  of  differential  prediction  results  in  tests  that  are  perfectly  fair  according 
to  the  Cleary  definition,  yet  substantial  amounts  of  unfairness  were  indicated 
in  terms  of  the  Thorndike  model.  This  is  a phenomenon  often  reported  in  the 
literature  on  models  of  fairness;  different  models  of  fairness  can  sometimes  lead 
to  divergent  implications  about  the  fairness  of  a test  in  a given  selection 
situation.  Particularly  when  peaked  tests  are  employed,  these  two  fairness 
models  will  lead  to  different  conclusions. 

Future  Research 


The  present  study  investigated  only  a limited  class  of  test  instruments. 

The  conventional  tests  used  are  characterized  by  their  use  of  an  identical  fixed 
sequence  of  items  for  all  testees.  Recently,  a number  of  adaptive  testing  models 
have  been  developed  as  alternatives  to  the  conventional  model  (see  Weiss,  197A). 
In  adaptive  tests,  items  are  selected  on  an  individual  basis  for  each  testee. 
Research  with  adaptive  tests  (e.g. , McBride  S Weiss,  1976;  Vale  & Weiss,  1975) 
has  shown  that  they  result  in  different  scure  distributions  than  do  conventional 
tests,  with  true  ability  held  constant.  Consequently,  adaptive  testing  methods 
might  result  in  different  degrees  of  fairness  in  test  scores.  Future  research 
should  explore  the  fairness  properties  of  adaptive  testing  models  and  compare 
them  with  those  of  conventional  tests. 
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APPENDIX 


Table  A 

Means  and  Standard  Deviations  of 
Item  Difficulty  Distributions  of  Item  Banks 


Test 

Length 

PEAKED 

TEST 

Item 

Discrimination 

(a)  . 

.3 

.7 

1.1 

M 

S.D. 

M 

S.D. 

M 

S.D. 

10 

.00 

.09 

,12 

-.07 

.11 

30 

,08 

.02 

.11 

-.02 

.10 

50 

.00 

.09 

-.01 

.11 

.00 

70 

.01 

.09 

.00 

.11 

.00 

.10 

100 

.00 

.09 

.01 

.11 

.00 

.10 

UNIFORM 

TEST 

Item 

Discrimination 

(a) 

Test 

.3 

,7 

1.1 

Length 

M 

S.D. 

M 

S.D, 

M 

S.D. 

10 

-.32 

1,82 

-.32 

1.82 

-.32 

1.82 

30 

-.02 

1.79 

-.02 

1.79 

-.02 

1.79 

50 

-.07 

1.71 

-.07 

1.71 

-.07 

1.71 

70 

-.17 

1.70 

-.17 

1.70 

-.17 

1.70 

100 

-.13 

1.77 

-.13 

1.77 

-.13 

1.77 

Table  B 

Score  Distribution  Characteristics  for  Conventional  Tests  of  Length  10,  as  a 
Function  of  Discrimination  (a).  Bias,  and  Group,  for  Uniform  and  Peaked  Tests 


a 

Bias 

Croup 

Mean 

Standard 
Uniform 
M.P.  ‘d.P. 

Deviation 

Peaked 

Skewness 

Kurtosla 

Uniform 

Peaked 

M.P. 

D.P. 

Uniform 

Peaked 

Uniform 

Peaked 

True 

-.074 

-.074 

1 .006 

1.006 

1.006  1 

l.OOb 

-.01 

-.01 

.22 

.22 

.30 

maj 

-.074 

-.074 

.496 

.496 

.544 

.544 

-.23 

14 

-.22 

-.24 

.5 

min 

-.198 

-.198 

.507 

.495 

.540 

.546 

-.24 

-.05 

-.08 

-.26 

I 

min 

-.328 

-.338 

.507 

.515 

.554 

.588 

-.13 

-.01 

-.13 

-.37 

2 

min 

-.583 

-.604 

.500 

.526 

.528 

.543 

.09 

.30 

-.16 

-.28 

.70 

maj 

-.074 

-.074 

.750 

.750 

.787 

.787 

-.27 

-.17 

-.32 

-.70 

.5 

min 

-.357 

-.393 

,7fi3 

.749 

.777 

.801 

-.21 

.22 

-.34 

-.62 

1 

min 

-.875 

-.697 

.786 

.769 

.751 

.806 

-.01 

.59 

-.37 

-.35 

2 

min 

-1.215 

-1.216 

.757 

.777 

.613 

.761 

.34 

1 . 14 

-.27 

.88 

maj 

-.074 

-.074 

.825 

.825 

.874 

.874 

-.41 

-.16 

.12 

-1.06 

.5 

min 

-.435 

-.485 

.880 

.834 

.877 

.885 

-.22 

.32 

-.30 

-l.OO 

I 

min 

-.813 

-.854 

.920 

.849 

.810 

.858 

-.  14 

.81 

-.41 

-.31 

2 

min 

-1.554 

-1.372 

.869 

.829 

.570 

.758 

.50 

2.00 

-.35 

3.92 

Note.  M.P.  Is  majority  prediction  e<iuation;  D.P.  is  differential  prediction  equation. 
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Table  C 

Score  Distribution  Characteristics  for  Conventional  Tests  of  Length  30,  as  a 
Function  of  Discrimination  (a),  Bias,  and  Group,  for  Uniform  and  Peaked  Tests 


Note.  M.P.  is  majority  prediction  equation;  D.P.  is  differential  prediction  equation. 


Table  D 

Score  Distribution  Characteristics  for  Conventional  Tests  of  Length  70,  as  a 
Function  of  Discrimination  (a).  Bias,  and  Group,  for  Uniform  and  Peaked  Tests 


a 

Bias 

Group 

Mean 

Standard 

Uniform 

Deviation 

Peaked 

— 

Skewness 

Kurtosls 

Uniform 

Peaked 

M.P. 

D.P. 

M.P.  1 

D.P. 

Uniform 

Peaked 

Uniform 

Peaked 

True 

-.07A 

-.074 

1.006 

1.006 

1.006  1 

.006 

-.01 

-.01 

.22 

.22 

.30 

naj 

-.074 

-.074 

.851 

.851 

.853 

.853 

-.00 

-.10 

-.13 

-.04 

.5 

min 

-.429 

-.445 

.876 

.858 

.864 

.865 

.03 

.04 

-.21 

-.04 

1 

min 

-.781 

-.810 

.878 

.S60 

.872 

.865 

.09 

.11 

-.16 

-.07 

2 

min 

-1.488 

-1.480 

.882 

.861 

.835 

.860 

.22 

.34 

-.09 

.03 

.70 

maj 

-.074 

-.074 

.960 

.960 

.960 

.960 

-.15 

-.11 

-.25 

-.67 

.5 

min 

-.508 

-.538 

.967 

.959 

.953 

.961 

-.02 

.22 

-.32 

-.64 

1 

min 

-.977 

-.978 

.966 

.954 

.908 

.954 

.20 

.53 

-.29 

-.29 

2 

min 

-1.828 

-1.732 

.973 

.938 

.719 

.916 

.52 

1.16 

-.03 

1.17 

maj 

-.074 

-.074 

.979 

.979 

.967 

.967 

-.19 

-.11 

.04 

-1.10 

.5 

min 

-.551 

-.541 

1.003 

.977 

.952 

.963 

-.15 

.37 

-.32 

-.90 

I 

min 

-1.048 

-.973 

.988 

.857 

.857 

.942 

.20 

.88 

-.42 

-.02 

2 

min 

-1.945 

-1.595 

.879 

.955 

.551 

.842 

.68 

2.13 

.16 

4.86 

Note.  H.P.  is  majority  prediction  equation;  D.P.  is  differential  prediction  equation. 
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Table  E 

Score  Distribution  Characterlatlcs  for  Conventional  Teats  of  Length  100,  as  a 
Function  of  Discrimination  (a).  Bias,  and  Group,  for  Uniform  and  Peaked  Tests 


a 

Bias 

Croup 

Mean 

Standard 

Uniform 

Deviation 

Peaked 

Skewness 

Kurtosis 

Uniform 

Peaked 

M.P. 

D.P. 

M.P. 

D.P. 

Uniform 

Peaked 

Uniform 

Peaked 

True 

-.074 

-.074 

1.006 

1.006 

1.006 

1.006 

-.01 

-.01 

.22 

.22 

.30 

maj 

-.074 

-.074 

.889 

.889 

.893 

.893 

-.06 

0.16 

-.05 

.07 

.5 

min 

-.456 

-.479 

.903 

.892 

.918 

.902 

-.00 

-.05 

-.13 

-.02 

1 

min 

-.837 

-.880 

.904 

.893 

.914 

.898 

.05 

.08 

-.04 

-.05 

2 

min 

-1.606 

-1.638 

.911 

.891 

.875 

.895 

.22 

.28 

-.09 

-.03 

.70 

maj 

-.074 

-.074 

.972 

.972 

.972 

.972 

-.14 

-.13 

-.24 

.65 

.5 

min 

-.526 

-.553 

.971 

.971 

.969 

.973 

.03 

.19 

-.24 

-.64 

1 

min 

-.993 

-1.011 

.962 

.966 

.922 

.965 

.19 

.52 

-.22 

-.28 

2 

min 

-1.871 

-1.782 

.866 

.955 

.727 

.930 

.53 

1.14 

.00 

1.08 

maj 

-.074 

-.074 

.987 

.987 

.972 

.972 

-.10 

-.12 

-.10 

-1.07 

.5 

min 

-.554 

-.548 

1.004 

.985 

.960 

.968 

.06 

.37 

-.23 

-.90 

1 

min 

-1.049 

-.985 

.981 

.981 

.863 

.947 

.19 

.88 

-.36 

-.05 

2 

min 

-1.958 

-1.616 

.875 

.966 

.554 

.846 

.63 

2.16 

.14 

5.03 

Note:  M.P.  is  majority  prediction  equation;  D.P.  Is  differential  prediction  equation. 


Table  F 

C'-Index  for  Uniform  (U)  and  Peaked  (P)  Tests 


Test  Length 

10 30  50  70  100 


Bias 

Group 

U 

P 

U 

P 

U 

U 

P 

U 

P 

0.0 

maj 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.5 

min 

-.124 

-.124 

-.258 

-.255 

-.317 

-.328 

-.355 

-.371 

-.382 

-.405 

diff 

-.124 

-.  124 

-.258 

-.255 

-.317 

-.328 

-.355 

-.371 

-.382 

-.405 

1.0 

min 

-.254 

-.264 

-.527 

-.534 

-.635 

-.664 

-.707 

-.736 

-.763 

-.806 

dlff 

-.254 

-.264 

-.527 

-.534 

-.635 

-.664 

-.707 

-.736 

-.763 

-.806 

2.0 

min 

-.509 

-.530 

-1.033 

-1.023 

-1.262 

-1.286 

-1.414 

-1.406 

-1.532 

-1.564 

diff 

-.509 

-.530 

-1.033 

-1.023 

-1.262 

-1.286 

-1.414 

-1.406 

-1.532 

-1.564 

0.0 

maj 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.5 

min 

-.283 

-.319 

-.377 

-.434 

-.422 

-.461 

-.434 

-.464 

-.452 

-.479 

diff 

-.283 

-.319 

-.377 

-.434 

-.422 

-.461 

-.434 

-.464 

-.452 

-.479 

1.0 

mlr 

-.586 

-.623 

-.801 

-.837 

-.879 

-.894 

-.903 

-.904 

-.919 

-.937 

diff 

-.586 

-.623 

-.801 

-.837 

-.879 

-.894 

-.903 

-.904 

-.919 

-.937 

2.0 

min 

-1.141 

-1.142 

-1.533 

-1.524 

-1.675 

-1.634 

-1.754 

-1.658 

-1.797 

-1.708 

diff 

-1.141 

-1.142 

-1.533 

-1.524 

-1.675 

-1.634 

-1.754 

-1.658 

-1.797 

-1.708 

0.0 

maj 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.5 

min 

-.361 

-.411 

-.433 

-.452 

-.462 

-.462 

-.477 

-.467 

-.480 

-.474 

diff 

-.361 

-.411 

-.433 

-.452 

-.462 

-.462 

-.477 

-.467 

-.480 

-.474 

l.O 

min 

-.739 

-.780 

-.901 

-.861 

-.946 

-.882 

-.974 

-.899 

-.975 

-.911 

diff 

-.739 

-.  780 

-.901 

-.861 

-.946 

-.882 

-.974 

-.899 

-.975 

-.911 

2.0 

min 

-1.480 

-1.298 

-1.760 

-1.470 

-1.778 

-1.499 

-1.071 

-1.521 

-1.884 

-1.542 

diff 

-1.480 

-1.298 

-1.760 

-1.470 

-1.778 

-1.499 

-1.871 

-1.521 

-1.684 

-1.542 

Table  G 


0.0 

naj 

38.4 

56.8 

45.4 

47.4 

41.6 

43.6 

44.0 

46.2 

44.0 

49.8 

.5 

min 

30.4 

46.2 

31.8 

33.8 

28.0 

28.2 

29.6 

30.2 

30.6 

30.8 

dlff 

-8.0 

-10.6 

-13.6 

-13.6 

-13.6 

-15.6 

-14.4 

-16.0 

-13.4 

-19.0 

1.0 

min 

21.6 

37.4 

20.4 

21.4 

18.0 

17.2 

18.2 

19.0 

17.2 

19.2 

dlff 

-16.8 

-19.4 

-25.0 

-26.0 

-23.6 

-26.6 

-25.8 

-27.2 

-26.8 

-30.6 

2.0 

min 

10.0 

20.2 

7.8 

7.2 

4.4 

3.6 

4.4 

4.6 

4.6 

4.6 

dlff 

-28.4 

-36.6 

-37.6 

-40.2 

-37.2 

-40.2 

-39.6 

-41.6 

-39.4 

-45.2 

0.0 

maj 

40.6 

53.0 

50.2 

45.0 

46.2 

49.4 

48.0 

49.8 

48.2 

49.2 

.5 

min 

26.6 

35.4 

34.4 

29,2 

28.4 

30.4 

32.0 

29.8 

31.0 

29.2 

dlff 

-14.0 

-17.6 

-15.8 

-15.8 

-17.8 

-19.0 

-16.0 

-20.0 

-17.2 

-20.0 

1.0 

min 

16.4 

22.8 

19.6 

15.0 

14.8 

15.6 

16.2 

15.4 

15.4 

14.8 

dlff 

-24.2 

-30.2 

-30.6 

-30.0 

-31.4 

-33.8 

-31.8 

-34.4 

-32.8 

-34.4 

2.0 

min 

4.6 

6.2 

3.8 

3.6 

2.8 

3.6 

2.4 

3.4 

3.0 

3.2 

dlff 

-36.0 

-46.8 

-46.4 

-41.4 

-43.4 

-45.8 

-45.6 

-46.4 

-45.2 

-46.0 

0.0 

maj 

40.0 

52.6 

47.2 

46.8 

47.8 

47,6 

47.8 

48.4 

48.8 

50. 0 

.5 

min 

25.2 

35.2 

29.6 

29.4 

28.0 

29.2 

27.8 

29.2 

29.0 

■-'9.2 

dlff 

-14.8 

-17.4 

-17.6 

-17.4 

-19.8 

-18.4 

-20.0 

-19.2 

-19.8 

-20.8 

1.0 

min 

15.0 

20.4 

17.4 

15.2 

15.8 

14.6 

13.8 

15.4 

15.4 

15.4 

dlff 

-25.0 

-32.2 

-29.3 

-31.6 

-32.0 

-33.0 

-34.0 

-33.0 

-33.4 

-34.6 

2.0 

min 

3.4 

4.8 

3.0 

3.4 

2.4 

3.4 

2.2 

3.0 

2.6 

2.6 

dlff 

-36.6 

-47.8 

-44.2 

-43.4 

-45.4 

-44.2 

-45.6 

-45.4 

-46.2 

-47.4 
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